A Japanese-English Technical Lexicon for Translation and Language Research
نویسندگان
چکیده
In this paper we present a Japanese-English Bilingual lexicon of technical terms. The lexicon was derived from the first and second NTCIR evaluation collections for research into cross-language information retrieval for Asian languages. While it can be utilized for translation between Japanese and English, the lexicon is also suitable for language research and language engineering. Since it is collection-derived, it contains instances of word variants and miss-spellings which make it eminently suitable for further research. For a subset of the lexicon we make available the collection statistics. In addition we make available a Katakana subset suitable for transliteration research. 1. NTCIR Cross-Language Retrieval NTCIR is a large evaluation initiative for Asian Language Search and Question Answering, currently in its Seventh evaluation. NTCIR is similar in scope to the TREC series of evaluations for English and to CLEF, the Cross Language Evaluation Forum a large European evaluation initiative dedicated to cross-language retrieval for European languages (Peters et al., 2007). NTCIR was developed to meet need for crossand multi-lingual retrieval research specifically for East Asian languages (Chinese, Japanese and Korean). The first and second NTCIR Workshops utilized a collection of abstracts from the journal proceedings of 66 Japanese technical societies. As such, the NTCIR-1 and NTCIR-2 collections are the only evaluation resources available to test automatic retrieval of scientific and technical documents in Japanese. Further details about NTCIR-1 may be found in (Kando et al, 1999). Later NTCIR workshops utilized news collections from newspapers and newswire services and expanded the language scope to Japanese, Chinese and Korean. In this paper we are concerned with aspects of deriving a lexicon of technical terminology which can be utilized for both translation and language engineering for further research into finding technical content between the English and Japanese languages. 2. NTCIR Test Collections Our lexicon is derived from the NTCIR-1 and NTCIR-2 workshop test collections. The collections consist of three disjoint sub-collections: • NTCIR-1 J-E gakkai collection (339,483 documents) -Author abstracts of articles from 65 Japanese scientific society hosted conferences 1 http://research.nii.ac.jp/ntcir/ 2 http://trec.nist.gov 3 http://www.clef-campaign.org for the period 1988-1992 (English and Japanese abstracts pre-joined, where English abstracts available). • NTCIR-2 J-E gakkai collection – extension of the NTCIR-1 collection for years 1997-1999. 77,433 English abstracts, 116,177 Japanese abstracts, as independent files (not pre-joined) • NTCIR-2 J-E kaken collection – abstracts of funded research final reports 1988-1997. 57,545 English abstracts, 287,071 Japanese abstracts, as independent files (not pre-joined) 2.1 NTCIR-1 J-E Collection The NTCIR-1 J-E collection consists of 339,483 documents, of which 98.5% (334,515 documents have Japanese abstracts) and only 188,907 (55.6%) have equivalent English abstracts. The salient characteristic, however, of the collection is that 313,673 (92.3%) of the documents have author-assigned keywords in both Japanese and English. The following is an example of keywords assigned: 画像センサ // コンピュテーショナルセンサ // 画像圧縮 // 画像符号化 Image Sensors // Computational Sensors // Image Compression // Image Coding Because only slightly more than half the documents have English abstracts, pairing keywords may be more useful than the more complicated task of pairing sentences in documents (the usual approach of statistical machine translation) to align term pairs. 2.2 NTCIR-2 J-E Gakkai Collection The NTCIR-2 J-E Gakkai collection was basically an extension of the NTCIR-1 collection for the additional years 1997-1999. Because the collection only covered two years, it was a smaller collection than NTCIR-1, consisting of slightly more than 116,000 documents, of which only 77,000 documents (66.6%) had English abstracts and/or English keywords. Of these, 71,839 documents had both English and Japanese keywords assigned by the authors. In order to extract a lexicon, the two independent files, Japanese abstracts and English abstracts had to be joined on a common document identification number (the NTCIR-1 collection had pre-joined abstracts from the two languages). 2.3 NTCIR-2 J-E Kaken collection The NTCIR-2 J-E Kaken collection consists of abstracts of final reports for academic research funded by the Japanese government between the years 1988-1997. The two independent files were 287,071 Japanese abstracts and 57,545 English abstracts, which were again joined to create a bilingual abstract subset of 57,512 records with both Japanese keywords and English keywords. The Kakan collection exhibited considerably more diversity in subject matter as well as less direct correspondence between English and Japanese assigned keywords. Below are two examples of keyword assignments for this collection: kaken-j-0965522600 |KYWE| environmental issues | mass media | public opinion | social research | content analysis | effects of mass communication | global warming | social psychology |KYWD| 環境問題 | マスメディア | 世論 | 社会調査 | 内容分析 | マスコミ効果論 | 地球温暖化 |
منابع مشابه
Comparing Multiple Methods for Japanese and Japanese-English Text Retrieval
The NACSIS collection of Japanese scienti c documents (with English titles) provides a solid foundation for information retrieval research into 1) segmentation methods for Japanese text, 2) e ective methods for monolingual Japanese retrieval, and 3) JapaneseEnglish cross-language retrieval. This paper compares multiple methods for Japanese and Japanese-English text retrieval. Our focus is on ac...
متن کاملAugmenting a Bilingual Lexicon with Information for Word Translation Disambiguation
We describe a method for augmenting a bilingual lexicon with additional information for selecting an appropriate translation word. For each word in the source language, we calculate a correlation matrix of its association words versus its translation candidates. We estimate the degree of correlation by using comparable corpora based on these assumptions: “parallel word associations” and “one se...
متن کاملBilingual Dictionary Construction with Transliteration Filtering
In this paper we present a bilingual transliteration lexicon of 170K Japanese-English technical terms in the scientific domain. Translation pairs are extracted by filtering a large list of transliteration candidates generated automatically from a phrase table trained on parallel corpora. Filtering uses a novel transliteration similarity measure based on a discriminative phrase-based machine tra...
متن کاملMental Representation of Cognates/Noncognates in Persian-Speaking EFL Learners
The purpose of this study was to investigate the mental representation of cognate and noncognate translation pairs in languages with different scripts to test the prediction of dual lexicon model (Gollan, Forster, & Frost, 1997). Two groups of Persian-speaking English language learners were tested on cognate and noncognate translation pairs in Persian-English and English-Persian directions with...
متن کاملA Machine Translation System From Japanese Into English - Another Perspective Of MT Systems
A machine translation system from Japanese into English is described. The system aims at translation of computer manuals, and basically follows to the transfer approach. The design principles of the system are discussed in detail, together with the overall constructions of the system. Especially, the effectiveness of lexicon-based procedures, i.e. lexicon-based analysis, transfer, and synthesis...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2008